Assessment Task - Predicting Categories of Bank Transaction Data ¶

Assessment Task Summary¶

  • In this assessment, we need to build a model that classifies transactions into the right financial categories. We will conduct a detailed analysis of the data, create a system for transforming features (including text and numbers), train various models, and develop a solution that we can explain and improve over time.

Table of Content¶

  • Problem Statement
  • Import Required Libraries
  • Load the Dataset
  • Understanding Data
  • Exploratory Data Analysis
  • Feature Engineering
  • Models Building, Training, and Evaluation
  • Models Performance Comparison and Results Interpretation
  • Model Explainability and Interpretability using LIME and PDPs
  • Prediction Based on New User Data
  • Steps to Enhance Model Performance
In [ ]:
 

Problem Statement¶

  • MoneyLion wants to help people understand and manage their money better by classifying their bank transactions. Each transaction can fit into categories like “Loans,” “Transfers,” or “Restaurants.” Our aim is to create a system that correctly assigns these categories to new transactions, helping users make better financial choices.
In [ ]:
 

Import Required Libraries¶

In [72]:
# Basic libraries
import re
import string
import warnings
import ssl
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# NLTK and text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet as wn

# Visualization tools
from wordcloud import WordCloud, STOPWORDS
from sklearn.inspection import PartialDependenceDisplay

# Feature extraction and dimensionality reduction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Preprocessing
from sklearn.preprocessing import LabelEncoder, RobustScaler, label_binarize
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Machine learning models
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

# Clustering
from sklearn.cluster import KMeans

# Metrics and evaluation
from sklearn.metrics import (
    roc_curve, auc, classification_report, roc_auc_score, confusion_matrix
)

# Model selection
from sklearn.model_selection import train_test_split

# Lime for explainability
from lime.lime_tabular import LimeTabularExplainer

# Warnings
warnings.filterwarnings('ignore')

# SSL context adjustment for NLTK downloads (if necessary)
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# NLTK downloads and setup
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

# NLTK utilities
STOP_WORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/ehtishamsadiq/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ehtishamsadiq/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ehtishamsadiq/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ehtishamsadiq/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [4]:
 
In [ ]:
 

Load the Dataset¶

In [73]:
bank_transactions = pd.read_csv("bank_transaction.csv")
user_profiles = pd.read_csv("user_profile.csv")
In [ ]:
 
In [ ]:
 

Understanding Data¶

In [74]:
print(f"Number of rows and columns in bank_transactions: {bank_transactions.shape}")
print(f"Number of rows and columns in user_profiles: {user_profiles.shape}")
Number of rows and columns in bank_transactions: (258779, 8)
Number of rows and columns in user_profiles: (1000, 7)
In [75]:
# top sample records of bank_transactions
bank_transactions.sample(10)
Out[75]:
client_id bank_id account_id txn_id txn_date description amount category
202957 880 804 925 33221 2023-09-25 00:00:00 PURCHASE 0922 ZIP.CO* Maryse Hemant NY 3168316... -4.700 Gas Stations
30326 315 1 1 91 2023-06-22 00:00:00 THE MYRON STRATT Payroll 230622 968 Meaghan Pr... 88.936 Payroll
251965 880 530 608 15634 2023-09-12 00:00:00 CHECK111 -13.104 Supermarkets and Groceries
175481 880 644 741 185616 2023-07-03 00:00:00 KROGER #0 1025 07/01 #Maryse Hemant KROGER #0 ... -3.932 Supermarkets and Groceries
120206 880 619 714 16864 2023-09-21 00:00:00 Purchase SHELL SERVICE S NORTH BEND WAUS -1.494 Convenience Stores
128889 880 399 449 141264 2023-08-18 08:19:38 Mars Shave Ice, LLC -1.430 Supermarkets and Groceries
62420 755 1 1 50 2023-09-15 00:00:00 CASH APP*WIIPE*CASH OUSan FranciscoUS 15.064 Third Party
223101 880 259 291 119477 2023-07-03 00:00:00 Pos Debit- 9774 9774 Cash App*sendmyshx 103... -2.000 Third Party
158582 880 481 547 49114 2023-07-13 00:00:00 DEBIT CARD PURCHASE 3168 VENMO* Visa Direct NY -4.000 Third Party
237652 880 788 906 156641 2023-07-02 00:00:00 ATM Withdrawal ATM MIDTOWN PLAZA 2 4242 M... -12.000 ATM

Observation:

  • Negative values in the amount column usually mean money is being spent or leaving the account. For example, a transaction at McDonald's categorized as Restaurants or at MEIJER under Supermarkets and Groceries indicates an expense.

  • Positive values represent money being added to the account. These could be deposits, refunds, or payroll credits. For example, a VISA Money Transfer Credit categorized as Payroll suggests salary or income.

In [ ]:
 
In [76]:
# sample ten records of user_profiles
user_profiles.sample(10)
Out[76]:
CLIENT_ID IS_INTERESTED_INVESTMENT IS_INTERESTED_BUILD_CREDIT IS_INTERESTED_INCREASE_INCOME IS_INTERESTED_PAY_OFF_DEBT IS_INTERESTED_MANAGE_SPENDING IS_INTERESTED_GROW_SAVINGS
410 411 False False False True False False
929 930 False False False False False False
860 861 False False False False False False
32 33 False False False False False False
491 492 False False False False False False
216 217 False False False True False False
159 160 False False False False False False
484 485 False False False False False False
510 511 False False False False False False
184 185 False False False False False False
In [ ]:
 
In [77]:
# descriptrive statistics of bank_transactions
bank_transactions['amount'].describe()
Out[77]:
count    258779.000000
mean          2.544952
std          81.132139
min       -9162.460000
25%          -6.000000
50%          -1.876000
75%           2.000000
max        9397.830000
Name: amount, dtype: float64

Observation:

  • The amount column has a wide range of values, from negative to positive, with a large STD. This means that some very large transactions are mixed with many smaller ones, which could affect model performance. To deal with this, we can transform the data to reduce the impact of extreme values or use outlier handling techniques.
In [ ]:
 

Sanity Checks¶

1. Check Missing values in both data-frames

In [79]:
bank_transactions.isna().sum()
Out[79]:
client_id        0
bank_id          0
account_id       0
txn_id           0
txn_date         0
description      0
amount           0
category       257
dtype: int64

Observations:

  • There are only 257 missing values in the category column.
In [80]:
bank_transactions[bank_transactions['category'].isna()].sample(2)
Out[80]:
client_id bank_id account_id txn_id txn_date description amount category
114185 880 862 994 124820 2023-08-08 19:00:00 Cash App*Maryse Hemant -4.20 NaN
64028 788 1 1 94 2023-07-30 19:00:00 Cash app*cash out visa direct caus 1.55 NaN
In [81]:
user_profiles.isna().sum() # no missing values
Out[81]:
CLIENT_ID                        0
IS_INTERESTED_INVESTMENT         0
IS_INTERESTED_BUILD_CREDIT       0
IS_INTERESTED_INCREASE_INCOME    0
IS_INTERESTED_PAY_OFF_DEBT       0
IS_INTERESTED_MANAGE_SPENDING    0
IS_INTERESTED_GROW_SAVINGS       0
dtype: int64
In [ ]:
 

2. Check consistency in the data-types of the both data-frames

In [82]:
print(f"Data-types of bank_transactions:\n{bank_transactions.dtypes}\n\n")

print(f"Data-types of user_profiles:\n{user_profiles.dtypes}")
Data-types of bank_transactions:
client_id        int64
bank_id          int64
account_id       int64
txn_id           int64
txn_date        object
description     object
amount         float64
category        object
dtype: object


Data-types of user_profiles:
CLIENT_ID                        int64
IS_INTERESTED_INVESTMENT          bool
IS_INTERESTED_BUILD_CREDIT        bool
IS_INTERESTED_INCREASE_INCOME     bool
IS_INTERESTED_PAY_OFF_DEBT        bool
IS_INTERESTED_MANAGE_SPENDING     bool
IS_INTERESTED_GROW_SAVINGS        bool
dtype: object

Observation:

  • data-type of txn-date column is object instead of datatime.
In [ ]:
 

3. Check duplicate values

In [83]:
print(f"Number of duplicates in bank_transactions: {bank_transactions.duplicated().sum()}") # no duplicates
print(f"Number of duplicates in user_profiles: {user_profiles.duplicated().sum()}") # no duplicates
Number of duplicates in bank_transactions: 0
Number of duplicates in user_profiles: 0
In [ ]:
 

4. Check unique values

In [84]:
print(f"Uniuque values in bank_transactions:\n{bank_transactions.nunique()}\n\n")

print(f"Uniuque values in user_profiles:\n{user_profiles.nunique()}")
Uniuque values in bank_transactions:
client_id         880
bank_id           990
account_id       1131
txn_id         190505
txn_date         7183
description    102108
amount          29120
category           33
dtype: int64


Uniuque values in user_profiles:
CLIENT_ID                        1000
IS_INTERESTED_INVESTMENT            2
IS_INTERESTED_BUILD_CREDIT          2
IS_INTERESTED_INCREASE_INCOME       2
IS_INTERESTED_PAY_OFF_DEBT          2
IS_INTERESTED_MANAGE_SPENDING       2
IS_INTERESTED_GROW_SAVINGS          2
dtype: int64

Observation

  • The description column contains 102108 unique values out of 258779. This indicates that some values in the description column are repeated across multiple records.
In [ ]:
 

5. Convert column names into lowercase

In [85]:
user_profiles.columns = user_profiles.columns.str.lower()

Merging Dataframes on client_id Column¶

  • Both data frames contain a client_id column, which we can use to merge our data frames into one.
In [86]:
# merge two datasets bank_transactions and user_profiles, on column client_id
merged_df = pd.merge(bank_transactions, user_profiles, how='left', on='client_id')
merged_df.head()
Out[86]:
client_id bank_id account_id txn_id txn_date description amount category is_interested_investment is_interested_build_credit is_interested_increase_income is_interested_pay_off_debt is_interested_manage_spending is_interested_grow_savings
0 1 1 1 4 2023-09-29 00:00:00 Earnin PAYMENT Donat... 20.0 Loans False False False False False False
1 1 1 1 3 2023-08-14 00:00:00 ONLINE TRANSFER FROM NDonatas DanyalDA O CARSO... 25.0 Transfer Credit False False False False False False
2 1 1 1 5 2023-09-25 00:00:00 MONEY TRANSFER AUTHOR... 20.0 Loans False False False False False False
3 1 1 2 1 2023-06-02 00:00:00 ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKIN... 16.0 Transfer Credit False False False False False False
4 1 1 2 2 2023-06-01 00:00:00 ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKIN... 4.0 Transfer Credit False False False False False False
In [85]:
 

Analysis of category column¶

In [87]:
merged_df['category'].value_counts()
Out[87]:
category
Uncategorized                 29392
Third Party                   28714
Restaurants                   26367
Transfer Credit               21561
Loans                         19605
Convenience Stores            18630
Supermarkets and Groceries    16750
Transfer Debit                15114
Gas Stations                  12919
Internal Account Transfer     11983
Payroll                        8100
Shops                          7418
Bank Fees                      6432
Transfer                       6275
ATM                            5672
Transfer Deposit               4976
Digital Entertainment          4525
Utilities                      4118
Clothing and Accessories       3190
Department Stores              2002
Insurance                      1754
Service                         910
Arts and Entertainment          397
Travel                          367
Food and Beverage Services      343
Interest                        280
Check Deposit                   211
Healthcare                      207
Telecommunication Services      159
Gyms and Fitness Centers         69
Payment                          41
Bank Fee                         36
Tax Refund                        5
Name: count, dtype: int64

Observation

  • The category column is dominated by a few categories like Uncategorized, Third-Party, and Restaurants, while many other categories have significantly fewer entries, which might indicate a class imbalance issue that could affect the performance of machine learning models.
  • To make things simpler and balance the categories better, we could think about merging some of them. Here are a few ideas:
    • Bank Fees and Bank Fee: These are similar, so we could just keep one category called "Bank Fees."
    • Transfer, Transfer Credit, Transfer Debit, and Transfer Deposit: Since all of these relate to transfers, we could combine them into a single "Transfers" category.
    • Food and Beverage Services and Restaurants: While they are different, we could merge these into "Food and Dining" if we don’t need to keep the distinction.
    • Digital Entertainment and Arts and Entertainment: These could be combined into an overall "Entertainment" category.
    • Convenience Stores, Supermarkets, and Groceries: We could group these into a broader "Retail and Groceries" category if we don’t need to differentiate between them.
    • Utilities and Telecommunication Services: These could easily be merged into one category called "Services."
    • Gyms and Fitness Centers and Healthcare: We could combine these into a broader "Health and Wellness" category.
    • Department Stores and Shops: These could be simplified into a single category called "Retail."
    • Payment and Check Deposit: We could group these into a "Deposits and Payments" category.
In [ ]:
 

Category Mapping¶

In [88]:
def merge_category(cat):
    if cat in ["Bank Fee", "Bank Fees"]:
        return "Bank Fees"
    
    elif cat in ["Food and Beverage Services", "Restaurants"]:
        return "Food and Dining"
    
    elif cat in ["Digital Entertainment", "Arts and Entertainment"]:
        return "Entertainment"
    
    elif cat in ["Gyms and Fitness Centers", "Healthcare"]:
        return "Health and Wellness"
    else:
        return cat


merged_df['category'] = merged_df['category'].apply(merge_category)
In [89]:
merged_df['category'].value_counts()
Out[89]:
category
Uncategorized                 29392
Third Party                   28714
Food and Dining               26710
Transfer Credit               21561
Loans                         19605
Convenience Stores            18630
Supermarkets and Groceries    16750
Transfer Debit                15114
Gas Stations                  12919
Internal Account Transfer     11983
Payroll                        8100
Shops                          7418
Bank Fees                      6468
Transfer                       6275
ATM                            5672
Transfer Deposit               4976
Entertainment                  4922
Utilities                      4118
Clothing and Accessories       3190
Department Stores              2002
Insurance                      1754
Service                         910
Travel                          367
Interest                        280
Health and Wellness             276
Check Deposit                   211
Telecommunication Services      159
Payment                          41
Tax Refund                        5
Name: count, dtype: int64

Why Should We Consider “Uncategorized” as Test Data?¶

  • I selected the Uncategorized category as the test dataset because I will use it to explore how my model handles unknown or unclassified transactions. By using clustering and visualization techniques like K-Means and t-SNE, I can identify potential groupings within these uncategorized transactions that align with known categories.
In [90]:
test_df = merged_df[(merged_df['category']=="Uncategorized") | (merged_df['category'].isna())] # selecting rows with category 'Uncategorized' or missing category
print(f"Number of rows in merged_df before dropping rows with category 'Uncategorized': {merged_df.shape[0]}")
merged_df.drop(test_df.index, inplace=True)
merged_df.reset_index(drop=True, inplace=True) # dropping rows with category 'Uncategorized'
print(f"Number of rows in test_df: {test_df.shape[0]}")
print(f"Number of rows in merged_df after dropping rows with category 'Uncategorized': {merged_df.shape[0]}")
Number of rows in merged_df before dropping rows with category 'Uncategorized': 258779
Number of rows in test_df: 29649
Number of rows in merged_df after dropping rows with category 'Uncategorized': 229130
In [ ]:
 
In [91]:
descriptions = test_df['description'].fillna("")
tfidf = TfidfVectorizer( stop_words='english',max_features=5000)
X_uncat = tfidf.fit_transform(descriptions)

kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X_uncat)

tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_uncat.toarray())
In [92]:
# X_uncat, kmeans_labels, X_tsne
print(f"X_uncat is a sparse matrix representation of the descriptions: {type(X_uncat)}")
print(f"kmeans_labels is an array of cluster labels: {type(kmeans_labels)}")
print(f"X_tsne is an array of t-SNE coordinates: {type(X_tsne)}")
X_uncat is a sparse matrix representation of the descriptions: <class 'scipy.sparse._csr.csr_matrix'>
kmeans_labels is an array of cluster labels: <class 'numpy.ndarray'>
X_tsne is an array of t-SNE coordinates: <class 'numpy.ndarray'>
In [93]:
plt.figure(figsize=(10, 7))
scatter = plt.scatter(
    X_tsne[:, 0], 
    X_tsne[:, 1], 
    c=kmeans_labels, 
    cmap='viridis', 
    alpha=0.7
)
plt.colorbar(scatter, label='Cluster Label')
plt.title("t-SNE Clustering of Uncategorized Transactions")
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.show()
No description has been provided for this image
In [94]:
test_df['cluster'] = kmeans_labels

for c_id in sorted(test_df['cluster'].unique()):
    print(f"\nCluster {c_id}:")
    sample_rows = test_df[test_df['cluster'] == c_id].head(5)
    for desc in sample_rows['description']:
        print(" ", desc)
Cluster 0:
  CHECK111
  From Savings - 7762
  From Savings - 7762
  From Savings - 7762
  From Savings - 7762

Cluster 1:
  Maryse Hemant FROM Maryse Hemant RASOALEJANDRE ON 08/10 REF # BACJFCCGPI30
  Maryse Hemant FROM Maryse Hemant Maryse Hemant ON 09/07 REF # BACHVG7CR09W
  Maryse Hemant FROM Maryse Hemant RASOALEJANDRE ON 08/10 REF # BACLZXBKSJ7G
  Maryse Hemant FROM Maryse Hemant ON 07/24 REF # BACMBPJUFW5N
  Maryse Hemant FROM Maryse Hemant ON 07/09 REF # NAV0HW6363TB THANKS

Cluster 2:
  Myra Gosia FROM Myra Gosia ON 07/15 REF # PP0RDWM74Z
  Myra Gosia FROM Myra Gosia ON 09/08 REF # PP0RK4F4QZ
  Myra Gosia FROM Myra Gosia ON 07/03 REF # PP0RD3YGVY BILLS
  Myra Gosia FROM Myra Gosia ON 07/15 REF # PP0RDWM74Z
  Myra Gosia FROM Myra Gosia ON 07/30 REF # BWS0HWRE20ZB PIZZA

Cluster 3:
  Empower
  RTP Credit RCVD from Empower
  Empower RTP CREDIT
  Empower RTP CREDIT
  Empower RTP CREDIT

Cluster 4:
  360 Checking Card Adjustment Signature (Credit) TARGET COM 3600 MN
  360 Checking Card Adjustment Signature (Credit) TARGET COM 3600 MN
  Insta Cash Repayment
  CHECK CARD REFUND
  CHECK CARD REFUND

Observations from above Clusters¶

  • Cluster 0: This group includes transactions from Empower, specifically RTP (Real-Time Payments) credits. The frequent mention of Empower indicates these are likely payments or transfers from a financial service called Empower. This could involve loans, repayments, or money transfers, highlighting it as a finance-related category.
  • Cluster 1: This seems to focus on Save Your Change, likely a savings program or app. It probably rounds up purchases to the nearest dollar and saves the extra money. This could fit under categories like Savings, Investment, or Financial Services, especially as an automated savings tool or a feature in a banking app.
In [95]:
# find all the records from merged_df where the description contains the Empower value in it
merged_df[merged_df['description'].str.contains("Empower", na=False)][['description','category']].sample(5)
Out[95]:
description category
136426 3168 Empower Inc 6/23 TheaHall ACH DEBIT Loans
33115 Point Of Sale Deposit - Empower Finance, InVis... Loans
164860 Empower TRANSFER 3168 Transfer Credit
22029 Transfer Empower ; "Empower Cash Advance" Loans
13669 Transfer Empower ; "Empower Cash Advance" Loans
In [ ]:
 

Exploratory Data Analysis¶

Univariate-Analysis¶

Visualization of category and amount column

In [96]:
sns.set_style("darkgrid")
sns.set_palette("deep")
category_counts = merged_df['category'].value_counts()[0:15]


fig, axes = plt.subplots(1, 2, figsize=(15, 8))

# create Plotting a bar chart using seaborn
sns.countplot(data=merged_df, x='category', order=category_counts.index, ax=axes[0])
axes[0].set_title('Distribution of Categories')
axes[0].set_xlabel('Category')
axes[0].set_ylabel('Frequency')
axes[0].tick_params(axis='x', rotation=70)

axes[1].pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%', textprops={'fontsize': 12})
axes[1].set_title('Proportion of Categories')


plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 
In [ ]:
 

Analysis of amount column

In [97]:
print(f"Kurtosis of amount: {merged_df['amount'].kurtosis()}")
print(f"Skewness of amount: {merged_df['amount'].skew()}")
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(
    data=merged_df, 
    x='amount', 
    kde=True,       # Show KDE (kernel density estimate) curve
    color='blue',   
    ax=axes[0]
)
axes[0].set_title('Distribution of Amount')


sns.boxplot(
    data=merged_df, 
    x='amount', 
    color='orange', 
    ax=axes[1]
)
axes[1].set_title('Box Plot of Amount')

plt.tight_layout()
plt.show()
Kurtosis of amount: 2430.822253700687
Skewness of amount: 10.38726423364106
No description has been provided for this image

The amount column shows a very uneven distribution:

  1. High Skewness (10.39):

    • This number indicates a strong right skew, meaning most values are small, but there are a few very large amounts.
  2. Very High Kurtosis (2430.82):

    • This suggests that there are many outliers and that most data points cluster around lower amounts, with a few very high values stretching the tail.

Why It’s Important:

  • This imbalance can distort average calculations, making standard statistical measures unreliable. We may need to apply transformations (like taking the log) or deal with outliers to improve the analysis and modeling.

Data Transformation¶

In [98]:
from sklearn.preprocessing import PowerTransformer
sns.set_theme(style="darkgrid", palette="deep")


# log transform (adding 1.5 to avoid log(0))
merged_df['log_amount'] = np.sign(merged_df['amount']) * np.log(np.abs(merged_df['amount']) + 1.5)


print("=== LOG TRANSFORMATION APPLIED ===")
print(f"Kurtosis of log_amount: {merged_df['log_amount'].kurtosis()}")
print(f"Skewness of log_amount: {merged_df['log_amount'].skew()}\n")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(
    data=merged_df, 
    x='log_amount', 
    kde=True,
    color='blue',   
    ax=axes[0]
)
axes[0].set_title('Log-Transformed Amount Distribution')

sns.boxplot(
    data=merged_df, 
    x='log_amount', 
    color='orange', 
    ax=axes[1]
)
axes[1].set_title('Log-Transformed Amount Box Plot')

plt.tight_layout()
plt.show()


#--------------------------------------------------------------------------
#  YEO-JOHNSON TRANSFORMATION 
#--------------------------------------------------------------------------
pt = PowerTransformer(method='yeo-johnson', standardize=True)
merged_df['yj_amount'] = pt.fit_transform(merged_df[['amount']])

print("=== YEO-JOHNSON TRANSFORMATION APPLIED ===")
print(f"Kurtosis of yj_amount: {merged_df['yj_amount'].kurtosis()}")
print(f"Skewness of yj_amount: {merged_df['yj_amount'].skew()}\n")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(
    data=merged_df, 
    x='yj_amount', 
    kde=True,
    color='blue',   
    ax=axes[0]
)
axes[0].set_title('Yeo-Johnson Amount Distribution')

sns.boxplot(
    data=merged_df, 
    x='yj_amount', 
    color='orange', 
    ax=axes[1]
)
axes[1].set_title('Yeo-Johnson Amount Box Plot')

plt.tight_layout()
plt.show()
=== LOG TRANSFORMATION APPLIED ===
Kurtosis of log_amount: -0.15163612062119158
Skewness of log_amount: 0.6870711783624583

No description has been provided for this image
=== YEO-JOHNSON TRANSFORMATION APPLIED ===
Kurtosis of yj_amount: 4388.583833520956
Skewness of yj_amount: -19.531379583528942

No description has been provided for this image
In [ ]:
 

Distribution plot for no of words in each document

In [99]:
def count_words(sentence):
    delimiters =  re.escape(string.punctuation)
    string_to_split = sentence
    result = re.findall(r'\b(?!\d+\b)\w+\b|' + delimiters, string_to_split)
    result = [s for s in result if not re.match(delimiters, s)]
    return len(result)
In [100]:
merged_df['count'] = merged_df['description'].apply(count_words)
#plotting the distribution plot for no of words in each document
import seaborn as sns
plt.figure(figsize=(8,8))
sns.distplot(merged_df['count'])
plt.xlim(0,40)
plt.xlabel('The no of words', fontsize = 16)
plt.title('The no of words distribtuion', fontsize = 18)
plt.show();
# **We can see that most of the Descritpions have number of words range between 5 to 15**
No description has been provided for this image

We can see that most of the Descriptions have a number of words ranging between 2 to 10

In [ ]:
 

Bi-variate Analysis and Multi-variate Analysis¶

  • How many users are interested in each financial goal?
  • How does the average transaction amount vary for users with different interests?
  • What is the distribution of transaction amounts across categories?
  • How often do users with certain interests (e.g., pay off debt) spend?
  • Do any of the interest flags tend to co-occur?
  • Which day(s) of the week or time of month has the highest transaction activity?
  • How do user interests intersect with transaction categories?

How many users are interested in each financial goal?

  • For each column, I can plot how many users have True vs. False. This reveals the overall distribution of interest flags in the user population.
In [101]:
interest_cols = [
    'is_interested_investment',
    'is_interested_build_credit',
    'is_interested_increase_income',
    'is_interested_pay_off_debt',
    'is_interested_manage_spending',
    'is_interested_grow_savings'
]

melted_df = merged_df[interest_cols].reset_index(drop=True)
melted_df = melted_df.melt(var_name='interest_flag', value_name='interest_value')

merged_df.head(1)
Out[101]:
client_id bank_id account_id txn_id txn_date description amount category is_interested_investment is_interested_build_credit is_interested_increase_income is_interested_pay_off_debt is_interested_manage_spending is_interested_grow_savings log_amount yj_amount count
0 1 1 1 4 2023-09-29 00:00:00 Earnin PAYMENT Donat... 20.0 Loans False False False False False False 3.068053 0.23842 4
In [102]:
plt.figure(figsize=(10, 5))
sns.countplot(
    data=melted_df,
    x='interest_flag',
    hue='interest_value'  # True/False
)
plt.title("Number of Users Interested (True/False) in Each Financial Goal")
plt.xlabel("Interest Flags")
plt.ylabel("Count of Users")
plt.legend(title="Interest Value")
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

How does the average transaction amount vary for users with different interests?

  • Box plots let us compare how transaction amounts differ between users who have a certain interest (True) and those who do not (False).
In [103]:
plt.figure(figsize=(12, 6))
sns.boxplot(
    data=merged_df,
    x='is_interested_investment',  # True/False on x-axis
    y='amount'
)
plt.title("Transaction Amount by Investment Interest (True/False)")
plt.xlabel("Is Interested in Investment?")
plt.ylabel("Transaction Amount")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

What is the distribution of transaction amounts across categories?

  • This helps us to see which categories have large or small transaction amounts.
In [104]:
plt.figure(figsize=(15, 8))
sns.boxplot(
    data=merged_df,
    x='category',
    y='amount'
)
plt.title("Distribution of Transaction Amounts by Category")
plt.xlabel("Category")
plt.ylabel("Transaction Amount")
plt.xticks(rotation=75)
plt.tight_layout()
plt.show()
No description has been provided for this image

How often do users with certain interests (e.g., pay off debt) spend?

  • Compare the proportion of different categories for True vs. False in a pay-off debt interest column.
In [105]:
plt.figure(figsize=(15, 8))
sns.countplot(
    data=merged_df,
    x='category',
    hue='is_interested_pay_off_debt'
)
plt.title("Category Frequency by Pay-Off-Debt Interest (True/False)")
plt.xlabel("Category")
plt.ylabel("Count of Transactions")
plt.xticks(rotation=70)
plt.legend(title="Is Interested in Paying Off Debt?")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

Do any of the interest flags tend to co-occur?

  • This indicates if many users wanting to “grow savings” also wish to “manage spending,” helping you understand the relationships among interest flags.
In [106]:
interest_df = merged_df[interest_cols].astype(int)  # Convert True/False to 1/0
corr_matrix = interest_df.corr()

plt.figure(figsize=(15, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    cmap='Blues',
    fmt=".2f",
    square=True
)
plt.title("Correlation Heatmap of User Interest Flags")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

Which day(s) of the week or time of month have the highest transaction activity?

In [107]:
merged_df['txn_date'] = pd.to_datetime(merged_df['txn_date'], errors='coerce')  
merged_df['day_of_week'] = merged_df['txn_date'].dt.day_name()  #  Monday, Tuesday

plt.figure(figsize=(8, 5))
sns.countplot(
    data=merged_df,
    x='day_of_week',
    order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
)
plt.title("Transaction Counts by Day of the Week")
plt.xlabel("Day of Week")
plt.ylabel("Count of Transactions")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [108]:
# For day of month, if you prefer:
merged_df['day_of_month'] = merged_df['txn_date'].dt.day
plt.figure(figsize=(10, 5))
sns.countplot(data=merged_df, x='day_of_month')
plt.title("Transaction Counts by Day of Month")
plt.xlabel("Day of Month")
plt.ylabel("Count of Transactions")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

How do user interests intersect with transaction categories?

  • By creating a cross-tabulation of category versus interest flag counts, we can determine if certain categories are particularly dominant among interested and not-interested user groups.
In [109]:
pivot_data = pd.crosstab(merged_df['category'], merged_df['is_interested_investment'])
# pivot_data has rows as categories and columns as True/False (0/1 if we cast)

plt.figure(figsize=(8, 8))
sns.heatmap(
    pivot_data, 
    annot=True, 
    fmt='d', 
    cmap='Blues'
)
plt.title("Cross-Tab Heatmap: Category vs. Investment Interest")
plt.ylabel("Category")
plt.xlabel("Is Interested in Investment (False=0, True=1)")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 
In [ ]:
 

WordCloud from the description Column

  • A WordCloud helps visualize the most frequent words in the description text.
In [120]:
# Combine all descriptions into one string, handling nulls
text = " ".join(str(desc) for desc in merged_df['description'].dropna())

stopwords = set(STOPWORDS)

wordcloud = WordCloud(
    width=800, 
    height=400,
    background_color='white',
    stopwords=stopwords,
    max_words=200  # limit number of words
).generate(text)

# Display the generated image
plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")  # turn off axis lines/ticks
plt.title("WordCloud of Transaction Descriptions")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

Feature Engineering¶

1. Extract day, month, and year values

In [121]:
from sklearn.pipeline import FunctionTransformer


def parse_txn_date(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df['txn_date'] = pd.to_datetime(df['txn_date'], errors='coerce')
    df['day_of_week'] = df['txn_date'].dt.dayofweek
    df['day_of_month'] = df['txn_date'].dt.day
    df['month'] = df['txn_date'].dt.month
    df['year'] = df['txn_date'].dt.year
    
    df.drop(columns=['txn_date'], inplace=True, errors='ignore')
    return df

parse_date_transformer = FunctionTransformer(parse_txn_date, validate=False)
In [410]:
 

2. Drop unnecessary columns

In [122]:
UNNECESSARY_COLS = [
    'client_id', 'bank_id', 'account_id', 'txn_id',
    'count', 'scaled_data', 'transformed_amount_yj',
    'log_amount', 'yj_amount',
    'year'  # if it’s the same value for all rows
]

def drop_unnecessary_columns(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for col in UNNECESSARY_COLS:
        if col in df.columns:
            df.drop(columns=[col], inplace=True, errors='ignore')
    return df

drop_cols_transformer = FunctionTransformer(drop_unnecessary_columns, validate=False)
In [392]:
 
In [ ]:
 

3. Conversion of is_interested columns into boolean values

In [124]:
def convert_booleans(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    interested_cols = [c for c in df.columns if c.startswith('is_interested')]
    for col in interested_cols:
        # Convert True/False to 1/0
        df[col] = df[col].apply(lambda x: 1 if x else 0)
    return df

bool_transformer = FunctionTransformer(convert_booleans, validate=False)
In [ ]:
 

4. Transformation of the amount column using log transformation

In [125]:
def log_transform_amount(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    # log transform: sign(amount) * log(abs(amount) + 1.5)
    if 'amount' in df.columns:
        df['amount'] = np.sign(df['amount']) * np.log(np.abs(df['amount']) + 1.5)
    return df

log_transformer = FunctionTransformer(log_transform_amount, validate=False)
In [ ]:
 

5. Text Vectorization of description column and why PCA is important for Description for column?

  • Remove all stopwords and punctuation marks, and convert all textual data to lowercase.
  • Lemmatization to generate a meaningful word out of the corpus of words.
  • Tokenization of corpus and Word vectorization using tfidf
In [126]:
def clean_sentence(sentence: str) -> str:
    if not isinstance(sentence, str):
        return ""
    delimiters = re.escape(string.punctuation)
    # \b(?!\d+\b)\w+\b matches word-like tokens that are not pure digits
    result = re.findall(r'\b(?!\d+\b)\w+\b|' + delimiters, sentence)
    return " ".join([s for s in result if not re.match(delimiters, s)])


def remove_stopwords(text: str) -> str:
    tokens = [word.lower() for word in text.split() if word.lower() not in STOP_WORDS]
    return " ".join(tokens)


def get_wordnet_pos(treebank_tag: str) -> str:
    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        return wn.NOUN

def lemmatize_text(text: str) -> str:
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    lemmatized_tokens = []
    for token, pos_ in pos_tags:
        wordnet_tag = get_wordnet_pos(pos_)
        lemma = lemmatizer.lemmatize(token.lower(), pos=wordnet_tag)
        lemmatized_tokens.append(lemma)
    return ' '.join(lemmatized_tokens)


def clean_and_lemmatize_description(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    if 'description' in df.columns:
        df['description'] = df['description'].apply(lambda x: ' '.join(str(x).split()))
        df['description'] = df['description'].apply(lambda x: x.strip())
        df['description'] = df['description'].apply(clean_sentence)
        df['description'] = df['description'].apply(remove_stopwords)
        df['description'] = df['description'].apply(lemmatize_text)
    return df

model_data  = merged_df.copy()
model_data = clean_and_lemmatize_description(model_data)
In [414]:
 
In [127]:
tfidf = TfidfVectorizer(stop_words='english', lowercase=False,max_features = 1000)   
tfidf.fit(merged_df['description'])

dictionary = tfidf.vocabulary_.items()
df_vector = tfidf.transform(merged_df['description']).toarray()
print(f'shape of the vector : {df_vector.shape}')


# use PCA to reduce dimensionality
pca = PCA(random_state=42)
pca.fit(df_vector)

# Explained variance for different number of components
fig, axes = plt.subplots(1, 2, figsize=(20, 5))

# for all components
axes[0].plot(np.cumsum(pca.explained_variance_ratio_))
axes[0].set_title('PCA - cumulative explained variance vs all components')
axes[0].set_xlabel('number of components')
axes[0].set_ylabel('cumulative explained variance')
axes[0].axhline(y=0.8, color='red', linestyle='--')
axes[0].axvline(x=100, color='green', linestyle='--')

# for zoomed to first 100 components
axes[1].plot(np.cumsum(pca.explained_variance_ratio_[:100]))
axes[1].set_title('PCA - cumulative explained variance vs first 100 components')
axes[1].set_xlabel('number of components')
axes[1].set_ylabel('cumulative explained variance')
axes[1].axhline(y=0.8, color='red', linestyle='--')
axes[1].axvline(x=100, color='green', linestyle='--')


plt.tight_layout()
plt.show()
shape of the vector : (229130, 1000)
No description has been provided for this image

Note: More than 80% of the variance is explained just by 100 components.

In [128]:
clean_desc_transformer = FunctionTransformer(clean_and_lemmatize_description, validate=False)
In [ ]:
 

Feature Engineering Pipeline¶

In [18]:
def get_feature_pipeline():
    # Steps of pipeline 
    #  Date parse  ->  Drop columns ->  Boolean conv ->  Log transform ->  Clean description
    feature_preprocessing = Pipeline([
        ('parse_date', parse_date_transformer),
        ('drop_cols', drop_cols_transformer),
        ('bool_convert', bool_transformer),
        ('log_amt', log_transformer),
        ('clean_desc', clean_desc_transformer)
    ])
    numeric_cols = ['amount', 'day_of_week', 'day_of_month', 'month'] 
    binary_cols = []  
    text_col = 'description'
    numeric_transformer = Pipeline([
        ('scaler', RobustScaler())
    ])
    
    text_transformer = Pipeline([
        ('tfidf', TfidfVectorizer(stop_words='english', lowercase=False, max_features=1000))
    ])

    
    final_preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_cols),
            ('text', text_transformer, text_col),
            ('binary', 'passthrough', binary_cols),
        ],
        remainder='drop'
    )
    
    # create a chain final_preprocessor -> PCA, PCA will operate on the combined numeric + TF-IDF matrix
    pipeline_full = Pipeline([
        ('custom_steps', feature_preprocessing),
        ('final_preprocessor', final_preprocessor),
        ('pca', PCA(n_components=100, random_state=42))
    ])
    
    return pipeline_full
In [ ]:
le = LabelEncoder()
merged_df['category'] = le.fit_transform(merged_df['category'])
X = merged_df.drop(columns=['category'], errors='ignore')
y = merged_df['category'].copy()

Train Test Split

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    stratify=y,
    random_state=42
)
In [ ]:
 

Build and Fit the Pipeline

In [ ]:
# create instance of the pipeline
feature_pipeline = get_feature_pipeline()

# fit the pipeline
feature_pipeline.fit(X_train)


# transform both train and test sets
X_train_transformed = feature_pipeline.transform(X_train)
X_test_transformed = feature_pipeline.transform(X_test)

test_df.drop(columns=['category'], inplace=True, errors='ignore')
test_df_transformed = feature_pipeline.transform(test_df)


print("Train shape after pipeline:", X_train_transformed.shape)
print("Test shape after pipeline:", X_test_transformed.shape)
print("Test shape after pipeline:", test_df_transformed.shape)
Train shape after pipeline: (160391, 100)
Test shape after pipeline: (68739, 100)
Test shape after pipeline: (29649, 100)
In [ ]:
 

Save Feature Engineering Pipeline

In [63]:
# Save Feature Engineering Pipeline
joblib.dump(feature_pipeline, 'models/feature_engineering_pipeline.pkl')
Out[63]:
['models/feature_engineering_pipeline.pkl']

Models Building, Training, and Evaluation¶

1. Helper Functions for Metrics & Plots¶

In [ ]:
 
Helper Functionn to evaluate model Performance¶
In [20]:
def evaluate_model(model, X, y):
    y_pred = model.predict(X)
    y_proba = model.predict_proba(X)
    
    try:
        roc_auc = roc_auc_score(y, y_proba, multi_class='ovr', average='macro')
    except ValueError:
        # if there's only 1 class in y, fallback gracefully
        roc_auc = float('nan')
    
    from sklearn.metrics import accuracy_score
    acc = accuracy_score(y, y_pred)
    cr = classification_report(y, y_pred)
    cm = confusion_matrix(y, y_pred)
    
    return {
        'accuracy': acc,
        'roc_auc_ovr': roc_auc,
        'classification_report': cr,
        'confusion_matrix': cm,
        'y_pred': y_pred
    }
In [ ]:
 

Help function to plot roc curve¶

In [21]:
 
In [22]:
def plot_roc_curve_multi_class(model, X, y, ax=None, title="ROC Curve"):

    y_score = model.predict_proba(X)
    classes = np.unique(y)
    n_classes = len(classes)
    y_bin = label_binarize(y, classes=classes)  
    
    fpr = {}
    tpr = {}
    roc_auc = {}
    
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_bin[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
    
    # for micro-average
    fpr["micro"], tpr["micro"], _ = roc_curve(y_bin.ravel(), y_score.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    
    # for macro-average
    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
    mean_tpr = np.zeros_like(all_fpr)
    for i in range(n_classes):
        mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
    mean_tpr /= n_classes
    
    fpr["macro"] = all_fpr
    tpr["macro"] = mean_tpr
    roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
    
    if ax is None:
        fig, ax = plt.subplots()
    
    ax.plot(fpr["micro"], tpr["micro"],
            label='micro-average ROC (area = {0:0.2f})'
                  ''.format(roc_auc["micro"]),
            color='deeppink', linestyle=':', linewidth=4)
    
    ax.plot(fpr["macro"], tpr["macro"],
            label='macro-average ROC (area = {0:0.2f})'
                  ''.format(roc_auc["macro"]),
            color='navy', linestyle=':', linewidth=4)
    
    # Plot ROC curve for each class
    for i in range(n_classes):
        ax.plot(fpr[i], tpr[i], lw=2, label='Class {0} (area = {1:0.2f})'
                                           ''.format(classes[i], roc_auc[i]))
    
    ax.plot([0, 1], [0, 1], 'k--', lw=2)
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(title)
    ax.legend(loc="lower right")
In [ ]:
 

2. Train Multiple Models¶

In [23]:
models = {}
results = {}

1. Random Forest¶

In [24]:
model_name  = "Random Forest"
print(f"Training Random Forest...")
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_transformed, y_train)
models['RandomForest'] = rf
print(f"Random Forest Done")
Training Random Forest...
Random Forest Done
In [25]:
print(f"\n=== Evaluating {model_name} on Train Set ===")
train_eval = evaluate_model(rf, X_train_transformed, y_train)
print("Accuracy:", train_eval['accuracy'])
print("ROC AUC (macro):", train_eval['roc_auc_ovr'])
print("Classification Report:\n", train_eval['classification_report'])

print(f"\n=== Evaluating {model_name} on Test Set ===")
test_eval = evaluate_model(rf, X_test_transformed, y_test)
print("Accuracy:", test_eval['accuracy'])
print("ROC AUC (macro):", test_eval['roc_auc_ovr'])
print("Classification Report:\n", test_eval['classification_report'])
# save the model in the model's folder 
joblib.dump(rf, f"models/{model_name}_model.pkl")


if test_df_transformed is not None and test_df_transformed.shape[0] > 0:
    print(f"\n=== {model_name} Predictions on test_df_transformed ===")
    pred_uncat = rf.predict(test_df_transformed)
    print("Predicted categories (first 10):", pred_uncat[:10])
    
results[model_name] = {
    'train_eval': train_eval,
    'test_eval': test_eval
}
=== Evaluating Random Forest on Train Set ===
Accuracy: 0.9976120854661421
ROC AUC (macro): 0.9999945187388483
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3970
           1       1.00      1.00      1.00      4528
           2       1.00      1.00      1.00       148
           3       1.00      1.00      1.00      2233
           4       1.00      1.00      1.00     13041
           5       0.98      0.97      0.98      1401
           6       1.00      0.99      1.00      3445
           7       0.99      1.00      0.99     18697
           8       1.00      1.00      1.00      9043
           9       0.99      0.97      0.98       193
          10       1.00      1.00      1.00      1228
          11       1.00      1.00      1.00       196
          12       1.00      1.00      1.00      8388
          13       1.00      1.00      1.00     13724
          14       1.00      1.00      1.00        29
          15       1.00      1.00      1.00      5670
          16       0.99      0.99      0.99       637
          17       1.00      1.00      1.00      5193
          18       1.00      1.00      1.00     11725
          19       1.00      1.00      1.00         3
          20       1.00      0.99      1.00       111
          21       1.00      1.00      1.00     20100
          22       1.00      1.00      1.00      4392
          23       1.00      1.00      1.00     15093
          24       1.00      1.00      1.00     10580
          25       1.00      1.00      1.00      3483
          26       1.00      0.99      0.99       257
          27       0.99      0.98      0.98      2883

    accuracy                           1.00    160391
   macro avg       1.00      1.00      1.00    160391
weighted avg       1.00      1.00      1.00    160391


=== Evaluating Random Forest on Test Set ===
Accuracy: 0.9046247399583933
ROC AUC (macro): 0.9555875199392778
Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1702
           1       0.99      0.99      0.99      1940
           2       1.00      0.92      0.96        63
           3       0.74      0.58      0.65       957
           4       0.79      0.87      0.83      5589
           5       0.67      0.45      0.54       601
           6       0.89      0.88      0.89      1477
           7       0.84      0.85      0.85      8013
           8       0.75      0.77      0.76      3876
           9       0.64      0.43      0.52        83
          10       0.79      0.71      0.75       526
          11       0.92      0.98      0.95        84
          12       1.00      0.99      1.00      3595
          13       0.95      0.96      0.95      5881
          14       0.79      0.92      0.85        12
          15       0.94      0.96      0.95      2430
          16       0.70      0.47      0.57       273
          17       0.88      0.79      0.83      2225
          18       0.84      0.83      0.83      5025
          19       0.00      0.00      0.00         2
          20       0.72      0.58      0.64        48
          21       0.98      0.98      0.98      8614
          22       0.95      0.92      0.94      1883
          23       0.98      0.99      0.99      6468
          24       0.99      0.99      0.99      4534
          25       0.96      0.97      0.96      1493
          26       0.84      0.75      0.79       110
          27       0.82      0.78      0.80      1235

    accuracy                           0.90     68739
   macro avg       0.83      0.80      0.81     68739
weighted avg       0.90      0.90      0.90     68739


=== Random Forest Predictions on test_df_transformed ===
Predicted categories (first 10): [18 18 13 21 13 13 13 13 13 13]
In [26]:
# plot roc for both training and testing
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
plot_roc_curve_multi_class(rf, X_train_transformed, y_train, ax=axes[0], title=f"{model_name} - Train ROC")
plot_roc_curve_multi_class(rf, X_test_transformed, y_test, ax=axes[1], title=f"{model_name} - Test ROC")
plt.tight_layout()
plt.show()
No description has been provided for this image

2. Logistic Regression¶

In [27]:
print(f"Training Logistic Regression...")
lr= LogisticRegression(multi_class='multinomial', solver='lbfgs')
lr.fit(X_train_transformed, y_train)
models['LogisticRegression'] = lr
print(f"Logistic Regression Done")
Training Logistic Regression...
Logistic Regression Done
In [28]:
model_name  = "Logistic Regression"
print(f"\n=== Evaluating {model_name} on Train Set ===")
train_eval = evaluate_model(lr, X_train_transformed, y_train)
print("Accuracy:", train_eval['accuracy'])
print("ROC AUC (macro):", train_eval['roc_auc_ovr'])
print("Classification Report:\n", train_eval['classification_report'])

print(f"\n=== Evaluating {model_name} on Test Set ===")
test_eval = evaluate_model(lr, X_test_transformed, y_test)
print("Accuracy:", test_eval['accuracy'])
print("ROC AUC (macro):", test_eval['roc_auc_ovr'])
print("Classification Report:\n", test_eval['classification_report'])
# save the model in the model's folder 
joblib.dump(lr, f"models/{model_name}_model.pkl")


if test_df_transformed is not None and test_df_transformed.shape[0] > 0:
    print(f"\n=== {model_name} Predictions on test_df_transformed ===")
    pred_uncat = lr.predict(test_df_transformed)
    print("Predicted categories (first 10):", pred_uncat[:10])
    
results[model_name] = {
    'train_eval': train_eval,
    'test_eval': test_eval
}
=== Evaluating Logistic Regression on Train Set ===
Accuracy: 0.765342195010942
ROC AUC (macro): 0.9560158440257897
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.99      0.97      3970
           1       0.95      0.94      0.94      4528
           2       0.76      0.74      0.75       148
           3       0.61      0.32      0.42      2233
           4       0.55      0.67      0.60     13041
           5       0.59      0.10      0.17      1401
           6       0.71      0.53      0.61      3445
           7       0.52      0.75      0.62     18697
           8       0.44      0.38      0.41      9043
           9       0.00      0.00      0.00       193
          10       0.42      0.11      0.17      1228
          11       0.86      0.62      0.72       196
          12       0.95      0.92      0.93      8388
          13       0.85      0.82      0.83     13724
          14       0.00      0.00      0.00        29
          15       0.85      0.89      0.87      5670
          16       0.00      0.00      0.00       637
          17       0.87      0.59      0.71      5193
          18       0.73      0.59      0.65     11725
          19       0.00      0.00      0.00         3
          20       0.80      0.14      0.24       111
          21       0.93      0.94      0.94     20100
          22       0.78      0.69      0.73      4392
          23       0.93      0.96      0.94     15093
          24       0.93      0.96      0.94     10580
          25       0.90      0.86      0.88      3483
          26       1.00      0.03      0.06       257
          27       0.76      0.63      0.69      2883

    accuracy                           0.77    160391
   macro avg       0.67      0.54      0.56    160391
weighted avg       0.77      0.77      0.76    160391


=== Evaluating Logistic Regression on Test Set ===
Accuracy: 0.7650678654039191
ROC AUC (macro): 0.9532511274695313
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.99      0.97      1702
           1       0.94      0.94      0.94      1940
           2       0.83      0.70      0.76        63
           3       0.62      0.29      0.40       957
           4       0.56      0.68      0.61      5589
           5       0.61      0.09      0.16       601
           6       0.70      0.54      0.61      1477
           7       0.52      0.75      0.62      8013
           8       0.44      0.40      0.42      3876
           9       0.00      0.00      0.00        83
          10       0.37      0.09      0.15       526
          11       0.85      0.52      0.65        84
          12       0.94      0.92      0.93      3595
          13       0.85      0.82      0.83      5881
          14       0.00      0.00      0.00        12
          15       0.85      0.89      0.87      2430
          16       0.00      0.00      0.00       273
          17       0.88      0.57      0.69      2225
          18       0.72      0.59      0.65      5025
          19       0.00      0.00      0.00         2
          20       1.00      0.10      0.19        48
          21       0.93      0.94      0.94      8614
          22       0.79      0.69      0.74      1883
          23       0.93      0.96      0.94      6468
          24       0.93      0.96      0.94      4534
          25       0.91      0.87      0.89      1493
          26       1.00      0.02      0.04       110
          27       0.73      0.62      0.67      1235

    accuracy                           0.77     68739
   macro avg       0.67      0.53      0.56     68739
weighted avg       0.77      0.77      0.76     68739


=== Logistic Regression Predictions on test_df_transformed ===
Predicted categories (first 10): [18 18 13 23  4  4  4  4 21 13]
In [29]:
# plot roc for both training and testing
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
plot_roc_curve_multi_class(lr, X_train_transformed, y_train, ax=axes[0], title=f"{model_name} - Train ROC")
plot_roc_curve_multi_class(lr, X_test_transformed, y_test, ax=axes[1], title=f"{model_name} - Test ROC")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 
In [ ]:
 

3. GaussianNB(Naive Bayes Algorithm)¶

In [30]:
print(f"Training GaussianNB...")
mnb = GaussianNB()
mnb.fit(X_train_transformed, y_train)
models['GaussianNB'] = mnb
print(f"GaussianNB Done")
Training GaussianNB...
GaussianNB Done
In [31]:
model_name = "GaussianNB"
print(f"\n=== Evaluating {model_name} on Train Set ===")
train_eval = evaluate_model(mnb, X_train_transformed, y_train)
print("Accuracy:", train_eval['accuracy'])
print("ROC AUC (macro):", train_eval['roc_auc_ovr'])
print("Classification Report:\n", train_eval['classification_report'])

print(f"\n=== Evaluating {model_name} on Test Set ===")
test_eval = evaluate_model(mnb, X_test_transformed, y_test)
print("Accuracy:", test_eval['accuracy'])
print("ROC AUC (macro):", test_eval['roc_auc_ovr'])
print("Classification Report:\n", test_eval['classification_report'])
# save the model in the model's folder 
joblib.dump(mnb, f"models/{model_name}_model.pkl")

if test_df_transformed is not None and test_df_transformed.shape[0] > 0:
    print(f"\n=== {model_name} Predictions on test_df_transformed ===")
    pred_uncat = mnb.predict(test_df_transformed)
    print("Predicted categories (first 10):", pred_uncat[:10])
    
results[model_name] = {
    'train_eval': train_eval,
    'test_eval': test_eval
}
=== Evaluating GaussianNB on Train Set ===
Accuracy: 0.5495071419219283
ROC AUC (macro): 0.9196774599226875
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.83      0.89      3970
           1       0.89      0.90      0.89      4528
           2       0.77      0.87      0.82       148
           3       0.26      0.38      0.31      2233
           4       0.65      0.36      0.46     13041
           5       0.14      0.49      0.22      1401
           6       0.47      0.59      0.52      3445
           7       0.73      0.38      0.50     18697
           8       0.57      0.27      0.36      9043
           9       0.04      0.20      0.07       193
          10       0.13      0.19      0.15      1228
          11       0.89      0.61      0.72       196
          12       0.85      0.81      0.83      8388
          13       0.68      0.63      0.65     13724
          14       0.02      0.55      0.03        29
          15       0.79      0.69      0.74      5670
          16       0.01      0.33      0.02       637
          17       0.86      0.54      0.66      5193
          18       0.61      0.27      0.38     11725
          19       0.06      1.00      0.11         3
          20       0.06      0.70      0.10       111
          21       0.78      0.58      0.67     20100
          22       0.43      0.51      0.47      4392
          23       0.83      0.67      0.74     15093
          24       0.74      0.74      0.74     10580
          25       0.81      0.76      0.79      3483
          26       0.04      0.25      0.07       257
          27       0.17      0.75      0.27      2883

    accuracy                           0.55    160391
   macro avg       0.51      0.57      0.47    160391
weighted avg       0.70      0.55      0.60    160391


=== Evaluating GaussianNB on Test Set ===
Accuracy: 0.5500952879733485
ROC AUC (macro): 0.8995635053938663
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.82      0.88      1702
           1       0.88      0.90      0.89      1940
           2       0.76      0.86      0.81        63
           3       0.24      0.34      0.28       957
           4       0.67      0.37      0.47      5589
           5       0.15      0.51      0.23       601
           6       0.47      0.60      0.52      1477
           7       0.73      0.38      0.50      8013
           8       0.58      0.28      0.37      3876
           9       0.04      0.22      0.07        83
          10       0.11      0.18      0.14       526
          11       0.94      0.54      0.68        84
          12       0.86      0.82      0.84      3595
          13       0.67      0.63      0.65      5881
          14       0.02      0.58      0.03        12
          15       0.78      0.70      0.74      2430
          16       0.01      0.31      0.02       273
          17       0.87      0.52      0.65      2225
          18       0.60      0.28      0.38      5025
          19       0.00      0.00      0.00         2
          20       0.05      0.54      0.08        48
          21       0.79      0.59      0.67      8614
          22       0.45      0.53      0.49      1883
          23       0.84      0.66      0.74      6468
          24       0.73      0.75      0.74      4534
          25       0.83      0.74      0.78      1493
          26       0.03      0.19      0.06       110
          27       0.17      0.74      0.27      1235

    accuracy                           0.55     68739
   macro avg       0.51      0.52      0.46     68739
weighted avg       0.70      0.55      0.60     68739


=== GaussianNB Predictions on test_df_transformed ===
Predicted categories (first 10): [13 13 13 16  3  3  3  3 15 15]
In [32]:
# plot roc for both training and testing
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
plot_roc_curve_multi_class(mnb, X_train_transformed, y_train, ax=axes[0], title=f"{model_name} - Train ROC")
plot_roc_curve_multi_class(mnb, X_test_transformed, y_test, ax=axes[1], title=f"{model_name} - Test ROC")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

Models Performance Comparison and Results Interpretation¶

In [53]:
model_performance = {}

for model_name, model_data in results.items():
    test_eval = model_data['test_eval']
    
    y_pred = test_eval['y_pred']
    y_true = y_test 

    # calculate precision, recall, f1 from confusion matrix
    conf_matrix = confusion_matrix(y_true, y_pred)
    tp = np.diag(conf_matrix)  
    fn = conf_matrix.sum(axis=1) - tp  
    fp = conf_matrix.sum(axis=0) - tp  
    tn = conf_matrix.sum() - (tp + fn + fp)  

    precision = tp.sum() / (tp.sum() + fp.sum()) if (tp.sum() + fp.sum()) > 0 else 0
    recall = tp.sum() / (tp.sum() + fn.sum()) if (tp.sum() + fn.sum()) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    roc_auc = test_eval['roc_auc_ovr']  
    model_performance[model_name] = {
        'precision': precision,
        'recall': recall,
        'f1_score': f1_score,
        'roc_auc_ovr': roc_auc
    }
In [54]:
# show the performance of every model
for model_name, metrics in model_performance.items():
    print(f"\nPerformance for {model_name}:")
    for metric_name, value in metrics.items():
        print(f"{metric_name}: {value:.4f}")
Performance for Random Forest:
precision: 0.9046
recall: 0.9046
f1_score: 0.9046
roc_auc_ovr: 0.9556

Performance for Logistic Regression:
precision: 0.7651
recall: 0.7651
f1_score: 0.7651
roc_auc_ovr: 0.9533

Performance for GaussianNB:
precision: 0.5501
recall: 0.5501
f1_score: 0.5501
roc_auc_ovr: 0.8996

Models Performance Comparison Plot on Test data¶

In [55]:
model_names = list(results.keys()) 
metrics = ['accuracy', 'roc_auc_ovr', 'precision', 'recall', 'f1_score']
metric_values = {metric: [] for metric in metrics}

for model in model_names:
    test_eval = results[model]['test_eval']
    metric_values['accuracy'].append(test_eval['accuracy'])
    metric_values['roc_auc_ovr'].append(test_eval['roc_auc_ovr'])
    
    conf_matrix = test_eval['confusion_matrix']
    # calculate precision, recall, f1 from confusion matrix
    tp = np.diag(conf_matrix)  
    fn = conf_matrix.sum(axis=1) - tp 
    fp = conf_matrix.sum(axis=0) - tp  
    tn = conf_matrix.sum() - (tp + fn + fp)  
    
    precision = tp.sum() / (tp.sum() + fp.sum())
    recall = tp.sum() / (tp.sum() + fn.sum())
    f1_score = 2 * (precision * recall) / (precision + recall)
    
    metric_values['precision'].append(precision)
    metric_values['recall'].append(recall)
    metric_values['f1_score'].append(f1_score)


metric_values
Out[55]:
{'accuracy': [0.9046247399583933, 0.7650678654039191, 0.5500952879733485],
 'roc_auc_ovr': [0.9555875199392778, 0.9532511274695313, 0.8995635053938663],
 'precision': [0.9046247399583933, 0.7650678654039191, 0.5500952879733485],
 'recall': [0.9046247399583933, 0.7650678654039191, 0.5500952879733485],
 'f1_score': [0.9046247399583932, 0.7650678654039191, 0.5500952879733485]}
In [56]:
palette = sns.color_palette("husl", len(model_names))  
fig, ax = plt.subplots(figsize=(14, 8))
bar_width = 0.15 
x_indexes = np.arange(len(metrics))

for i, model in enumerate(model_names):
    metric_values_for_model = [metric_values[metric][i] for metric in metrics]
    ax.bar(
        x_indexes + i * bar_width,
        metric_values_for_model,
        width=bar_width,
        label=model,
        color=palette[i],
        edgecolor='black',
        alpha=0.9
    )

ax.set_xlabel("Metrics", fontsize=14, labelpad=10)
ax.set_ylabel("Values", fontsize=14, labelpad=10)
ax.set_title("Model Performance Comparison", fontsize=18, weight='bold', pad=20)
ax.set_xticks(x_indexes + bar_width)
ax.set_xticklabels(metrics, fontsize=12, weight='bold')
ax.legend(fontsize=12, title="Models", loc='upper left', bbox_to_anchor=(1, 1))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# add values on top of the bars
for i, model in enumerate(model_names):
    for j, metric in enumerate(metrics):
        ax.text(
            x_indexes[j] + i * bar_width,
            metric_values[metric][i] + 0.01,  
            f"{metric_values[metric][i]:.2f}",
            ha='center',
            fontsize=10,
            color='black',
            weight='bold'
        )

# background color for customizing the plot
fig.patch.set_facecolor('#f7f7f7')
ax.set_facecolor('#f7f7f7')

plt.tight_layout()
plt.show()
No description has been provided for this image

Interpretation of Results¶

The Random Forest model is clearly the best performer among the three models based on the provided metrics:

    1. Performance Metrics Comparison

    | Model | Precision | Recall | F1-Score | ROC-AUC-OVR | |----------------------|-----------|---------|----------|-------------| | Random Forest | 0.9046 | 0.9046 | 0.9046 | 0.9556 | | Logistic Regression | 0.7651 | 0.7651 | 0.7651 | 0.9533 | | GaussianNB | 0.5501 | 0.5501 | 0.5501 | 0.8996 |

    1. Key Observations
    • Random Forest:
      • Achieves the highest precision, recall, F1-score, and ROC-AUC across all models.
      • It handles imbalanced data better due to its ability to learn complex relationships and decision boundaries. This makes it effective for both majority and minority classes.
      • Why it’s best: The model balances performance across all metrics, making it reliable for both accurate and balanced predictions.
    • Logistic Regression:
      • Performs reasonably well, especially in terms of ROC-AUC (0.9533), which is close to Random Forest.
      • However, its precision, recall, and F1-score are significantly lower, likely due to its inability to model complex, non-linear relationships inherent in the dataset.
    • GaussianNB:
      • Performs the worst across all metrics.
      • Naive Bayes assumes feature independence, which is likely violated in this dataset where amount, description, and other features interact in complex ways.
In [ ]:
 
In [ ]:
 

Model Explainability and Interpretability using LIME and PDPs¶

  • Why I Used These:

    • LIME: To explain why the model made a specific prediction for a single transaction. It’s great for understanding individual decisions.
    • PartialDependenceDisplay: To see how a feature, like “amount” or “description,” affects predictions across all transactions. It helps find overall patterns.
  • When We Use These:

    • LIME: When you want to explain or debug the model’s decision for a single case, like why a transaction was labeled “Loans.”
    • PartialDependenceDisplay: When you want to understand how a feature impacts predictions across the whole dataset.
  • What the Outputs Are:

    • LIME: Shows which features were most important for a single prediction and how they influenced the result (e.g., positively or negatively).
    • PartialDependenceDisplay: Creates graphs that show how a feature affects predictions overall (average effect) and for individual cases (variability).

LIME

In [58]:
# create feature names
feature_names = [f"Feature {i}" for i in range(X_train_transformed.shape[1])]

# LimeTabularExplainer on  X_train_transformed with feature names
explainer = LimeTabularExplainer(
    training_data=X_train_transformed,
    feature_names=feature_names, 
    class_names=[str(cls) for cls in np.unique(y_train)],
    mode="classification"
)

i = 0
instance = X_test_transformed[i]

# explain the prediction of instance i
explanation = explainer.explain_instance(
    data_row=instance,
    predict_fn=rf.predict_proba,  
    num_features=20, 
    top_labels=1 
)

explanation.show_in_notebook(show_table=True)  

# save the explanation to a file 
explanation.save_to_file('lime_explanation.html')
In [ ]:
 

Partial Dependence Plots (PDPs)

In [ ]:
feature_indices = [0, 1, 2]  
target_class = 0  
fig, ax = plt.subplots(len(feature_indices), 1, figsize=(10, 5 * len(feature_indices)))

for i, feature_idx in enumerate(feature_indices):
    PartialDependenceDisplay.from_estimator(
        rf,
        X_test_transformed,
        features=[feature_idx],
        target=target_class,  
        kind="both",  
        ax=ax[i] if len(feature_indices) > 1 else ax,  
        line_kw={"color": "blue"},  
        pd_line_kw={"color": "red", "linewidth": 2},  
    )
    ax[i].set_title(f"PDP and ICE for Feature {feature_idx} (Target Class: {target_class})", fontsize=14)
    ax[i].set_ylabel("Predicted Value", fontsize=12)
    ax[i].set_xlabel(f"Feature {feature_idx}", fontsize=12)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 
In [ ]:
 
In [ ]:
 

Prediction Based on New User Data¶

In [ ]:
# load feature engineering pipeline and best model
feature_pipeline = joblib.load("models/feature_engineering_pipeline.pkl")
best_model = joblib.load("models/Random Forest_model.pkl")
In [69]:
new_user_data = pd.DataFrame({
    "txn_date": ["2025-01-01"],  
    "description": ["Payment to ABC Store"], 
    "amount": [10.5], 
    "is_interested_investment": [0], 
    "is_interested_build_credit": [1],
    "is_interested_increase_income": [0],
    "is_interested_pay_off_debt": [1],
    "is_interested_managed_spending": [1],
    "is_interested_grow_savings": [0]
})

# apply the feature pipeline to the new user data
new_user_data_transformed = feature_pipeline.transform(new_user_data)

# get the prediction and probabilities
predicted_category = best_model.predict(new_user_data_transformed)
predicted_proba = best_model.predict_proba(new_user_data_transformed)

# 
print("Predicted Category:", predicted_category[0])
print("Prediction Probabilities:", predicted_proba)
Predicted Category: 13
Prediction Probabilities: [[0.   0.01 0.   0.01 0.04 0.01 0.05 0.03 0.01 0.04 0.04 0.01 0.   0.22
  0.   0.13 0.01 0.03 0.01 0.   0.   0.16 0.01 0.03 0.07 0.06 0.01 0.01]]

Convert the predicted category back to original category

In [70]:
predicted_category = le.inverse_transform(predicted_category)
predicted_category
Out[70]:
array(['Loans'], dtype=object)
In [ ]:
 
In [ ]:
 

Steps to Enhance Model Performance¶

  1. Hyper-Parameter Tuning: Optimize the Random Forest model's settings using GridSearchCV or RandomizedSearchCV.

  2. Address Class Imbalance: Utilize SMOTE or class weighting to better manage imbalanced categories during training.

  3. Feature Engineering: Implement word embeddings for the description column to enrich text data and analyze feature importance to create impactful features.

  4. Ensemble Models: Combine different models like Random Forest, XGBoost, and Logistic Regression using techniques like stacking or blending to maximize strengths.

  5. Deploy the Model: Save the best model and set up APIs for real-time predictions.

  6. Monitor and Iterate: Regularly collect new data, retrain the model, and track its performance for continuous improvement.over time.

In [ ]:
 
In [164]:
 
In [ ]:
 
In [ ]:
 

Code for Notebook Customization¶

In [1]:
from IPython.core.display import HTML

style = """
    <style>
        body {
            background-color: #f2fff2;
        }
        h1 {
            text-align: center;
            font-weight: bold;
            font-size: 36px;
            color: #4295F4;
            text-decoration: underline;
            padding-top: 15px;
        }
        
        h2 {
            text-align: left;
            font-weight: bold;
            font-size: 30px;
            color: #4A000A;
            text-decoration: underline;
            padding-top: 10px;
        }
        
        h3 {
            text-align: left;
            font-weight: bold;
            font-size: 30px;
            color: #f0081e;
            text-decoration: underline;
            padding-top: 5px;
        }

        
        p {
            text-align: center;
            font-size: 12 px;
            color: #0B9923;
        }
    </style>
"""

html_content = """
<h1>Hello</h1>
<p>Hello World</p>
<h2> Hello</h2>
<h3> World </h3>
"""

HTML(style + html_content)
Out[1]:

Hello

Hello World

Hello

World

In [ ]: